Summarization Evaluation : Correlating Human Performance on an Extrinsic Task with Automatic Intrinsic Metrics

نویسنده

  • Stacy F. Hobson
چکیده

Title of dissertation: Text Summarization Evaluation: Correlating Human Performance on an Extrinsic Task with Automatic Intrinsic Metrics Stacy F. Hobson Doctor of Philosophy, 2007 Dissertation directed by: Professor Bonnie J. Dorr Department of Computer Science Text summarization evaluation is the process of assessing the quality of an individual summary produced by human or automatic methods. Many techniques have been proposed for text summarization and researchers require an easy and uniform method for evaluation of their summarization systems. Human evaluations are often costly, labor-intensive and time-consuming, but are known to produce the most accurate results. Automatic evaluations are fast, easy to use and reusable, but the quality of their results have not been independently shown to be similar to that of human evaluations. This thesis introduces a new human task-based summarization evaluation measure called Relevance Prediction that is a more intuitive measure of an individual’s performance on a real-world task than agreement based on external judgments. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user judges relevance based on a short summary and then that same user—not an independent user—decides whether to open (and judge) the corresponding document. This measure is shown to be a more reliable measure of task performance than LDC Agreement, a current external gold-standard based measure used in the summarization evaluation community. Six experimental studies are conducted to examine the existence of correlations between the human task-based evaluations of text summarization and the output of current intrinsic automatic evaluation metrics. The experimental results indicate that moderate, yet consistent correlations exist between the RelevancePrediction method and the ROUGE metric for single-document summarization. This work also formally establishes the usefulness of text summarization in reducing task time while maintaining a similar level of task judgment accuracy as seen with the full text documents. TEXT SUMMARIZATION EVALUATION: CORRELATION HUMAN PERFORMANCE ON AN EXTRINSIC TASK WITH AUTOMATIC INTRINSIC METRICS

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extrinsic Evaluation of Automatic Metrics for Summarization

This paper describes extrinsic-task evaluation of summarization. We show that it is possible to save time using summaries for relevance assessment without adversely impacting the degree of accuracy that would be possible with full documents. In addition, we demonstrate that the extrinsic task we have selected exhibits a high degree of interannotator agreement, i.e., consistent relevance decisio...

متن کامل

ارائه یک سیستم هوشمند و معناگرا برای ارزیابی سیستم های خلاصه ساز متون

Nowadays summarizers and machine translators have attracted much attention to themselves, and many activities on making such tools have been done around the world. For Farsi like the other languages there have been efforts in this field. So evaluating such tools has a great importance. Human evaluations of machine summarization are extensive but expensive. Human evaluations can take months to f...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

Intrinsic vs. Extrinsic Evaluation Measures for Referring Expression Generation

In this paper we present research in which we apply (i) the kind of intrinsic evaluation metrics that are characteristic of current comparative HLT evaluation, and (ii) extrinsic, human task-performance evaluations more in keeping with NLG traditions, to 15 systems implementing a language generation task. We analyse the evaluation results and find that there are no significant correlations betw...

متن کامل

Using Speech-Specific Characteristics for Automatic Speech Summarization

In this thesis we address the challenge of automatically summarizing spontaneous, multi-party spoken dialogues. The experimental hypothesis is that it is advantageous when summarizing such meeting speech to exploit a variety of speech-specific characteristics, rather than simply treating the task as text summarization with a noisy transcript. We begin by investigating which term-weighting metri...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007